Language Determination: Natural Language Processing from Scanned Document Images

نویسندگان

Penelope Sibun

A. Lawrence Spitz

چکیده

Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficient for many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall. 1 I n t r o d u c t i o n Computational linguists work with texts. Computational lin!mistic applications range from natural language understanding to information retrieval to machine translation. Such systems usually assume the language of the text that is being processed. However, as corpora become larger and more diverse this assumption becomes less warranted. Attention is now turning to the issue of determining the language or languages of a text before further processing is done. Several sources of information for language determination have been tried: short words (Kulikowski 1991, Ingle 1976); n-grams of words (Batchelder 1992); n-grams of characters (Cavner & Trenkle 1994); diacritics and special characters (Beesley 1988, Newman 1987); syllable characteristics (Mustonen 1965); morphology and syntax (Ziegler 1991). F~ch of these approaches is prGmising although none is completely accurate. More fundamentally, many rely on relatively large amounts of text data and all rely on data in the form of character codes (e.g., ASCID. In today's world of text-based information, however, not all sources of text will be character coded. Many documents such as incoming faxes, patent applications, and office memos are only accessible on paper. Processes such as Optical Character Recognition (OCR) have been developed for mapping paper documents into character-coded text. However, for applications like OCR, it is desirable to know the language a document is in before trying to decode its characters. There appears to be a fundamental Catch-22: natural language processing systems want to be able to work automatically with arbitrary documents, many of which may be available only on paper, and in the process, they minimally need to know which language or languages are present. The algorithms cited above can determine a document's language, but they require a character-coded representation of the text. OCR can produce such a representation, but OCR does not work well unless the language(s) of the document are known. So how can the language of a paper document be determined? We have developed a method which reliably determines the language or lan£xlages of a document image. In this paper, we discuss Roman-alphabet languages such as English, Polish, and Swahili; see Spitz (1994) for a discussion of the determination of Asian-script languages. Our method finesses the problems inherent in mapping from an image to a character-coded representation: we map instead from the image to a shape-based representation. The basal representation is the character shape code of which there are a small number. These shape codes are aggregated into word shape tokens which are delimited by white space. From examining these word shape tokens we can determine the language of the document. An example of the transformation from character codes to character shape codes is shown in figurel . Character codes Confidence in the international monetary system was shaky enough before last week's action. Character shape codes AxxAAxxxx ix AAx i x A x x x x A i x x x A xxxxAxxg xgxAxx xxx xAxAg xxxxgA AxAxxx AxxA xxxA'x xxAixx . Figure 1: Character code representation and character shape code representation. The shape-based representation of a document is proving to be a remarkably rich source of information. While our initial goal has been to use it for language identification, in support of downstream OCR pro-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

What do Journalists do with Documents? Field Notes for Natural Language Processing Researchers

Natural language processing and visualization systems have been proposed to help journalists analyze large sets of documents, but very little has been said on what journalists do with documents in practice. We review a collection of 15 stories completed with the Overview document mining platform, characterizing the source material and reporting tasks. The median document set contained 4,000 doc...

متن کامل

A methodology for document processing: separating text from images

This paper presents a methodology for document processing, by separating text paragraphs from images. The methodology is based on the recognition of text characters and words for the efficient separation text paragraphs from images by keeping their relationships for a possible reconstruction of the original page. The text separation and extraction is based on a hierarchical framing process. The...

متن کامل

Information Processing from Document Images

Analysis of document images for information extraction has become very prominent in recent past. Wide variety of information, which has been conventionally stored on paper is now being converted into electronic form for better storage and intelligent processing. This needs processing of documents using image analysis algorithms. Document image analysis differs from the conventional image proces...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

Language Determination: Natural Language Processing from Scanned Document Images

نویسندگان

چکیده

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

What do Journalists do with Documents? Field Notes for Natural Language Processing Researchers

A methodology for document processing: separating text from images

Information Processing from Document Images

عنوان ژورنال:

اشتراک گذاری